gprofiler2 is the R interface to the g:Profiler toolset.

Like the web interface, gprofiler2 performs ORA with g:GOSt against multiple databases simultaneously. It supports multiple namespaces and numerous species.

1. Load input data

The dataset comprises 2 experimental groups (differentiated and undifferentiated cells in response to treatment) with 3 replicates in each group. The raw data from Pezzini et al 2016 was subjected to differential gene expression analysis with Degust and the results file saved.

You will find a copy of the DE results Pezzini_DE.txt within the Functional_enrichment_workshop_2024/day_2 folder.

First, ensure that your working directory reflects this day_2 folder:

Please ask for assistance if this step does not show the day_2 folder path!

Once your working directory is set, load the DE input file, and inspect the first few lines:

data <- read_tsv("Pezzini_DE.txt", col_names = TRUE, show_col_types = FALSE)
head(data)

The dataframe shows genes with fold change and FDR values, along with some normalised counts values for the 6 samples.

Look on the environment pane of RStudio, and you can see a description ‘14420 obs. of 10 variables’ - this shows your dataframe consists of 10 columns and 14,420 genes.

2. Get ORA gene list

Now we need to filter for DEGs, and we will apply the same thresholds as applied in day 1: adjusted P values/FDR < 0.01, and log2fold change of 2.

degs <- data %>%
  filter(FDR < 0.01 & abs(Log2FC) > 2) %>%
  pull(Gene.ID)
cat("Number of genes passing FDR and fold change filter:", length(degs), "\n")

# Save the DEG gene list to disk: 
write.table(degs, "DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

We have 792 genes passing our filters.

3. Get background gene list

Recall from the webinar and day 1 of the workshop that an experimental background gene list is crucial to avoiding false positives and minimising tissue bias.

The analysis in Degust has already removed lowly expressed genes, so we can simply extract all genes from this data matrix as our background gene list and save it as our ‘background’ object.

background <- data$Gene.ID
cat("Number of background genes:", length(background), "\n")
Number of background genes: 14420 
# Save the background gene list to disk:
write.table(background, "background_genes.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

4. Run ORA with gost function

Before running the below code chunk, review the parameters. Observe the similarities to what was available on the g:Profiler web interface, for example organism, the correction method (g:Profiler’s custom g_scs method), and domain scope.

Every R package has a user guide that outlines available parameters. Open the gprofiler2 user guide and navigate to page detailing the gost function (currently page 6).

Under the Usage subheading, all available parameters are listed, along with their default settings. You can over-ride these parameters by specifying your custom values in your R command.

The below full code shows all parameters:

ora <- gost( 
    degs, 
    organism = "hsapiens", 
    ordered_query = FALSE, 
    multi_query = FALSE, 
    significant = TRUE, 
    exclude_iea = FALSE, 
    measure_underrepresentation = FALSE, 
    evcodes = FALSE, 
    user_threshold = 0.05, 
    correction_method = "g_SCS", 
    domain_scope = "custom_annotated", 
    custom_bg = background, 
    numeric_ns = "", 
    sources = NULL, 
    as_short_link = FALSE, 
    highlight = FALSE 
)

Since we are using many of the default parameters, we could shorten this code to what is shown below. The results would be identical, however not as transparent as far as easily identifying what parameters were applied to a run.

Tip: you can run the command formals(gost) to print out all parameters and their default values. This is a generic R function so can be used on other tools outside of gprofiler2.


#ora <- gost( 
#    degs, 
#    correction_method = "g_SCS", 
#    domain_scope = "custom_annotated",
#    custom_bg = background,
#)
formals(gost)

We have now run ORA with gprofiler2! View the to-most significant enrichments with the R head command. Only significant enrichments passing your specified threshold are included in the results object we have named ora.

head(ora$result)

These are all biological process. What databases did this run ORA against?

unique(ora$result$source)

Same as the web tool, we have enrichment results for GP (BP, CC, MF), HP (human phenotype), HPA (human protein atlas), KEGG, MiRNA, Reactome, Transcription Factors and WikiPathways.

5. Save the results to a file

The results can be saved to disk to simplify sharing or local viewing outside the R environment. The output can be re-ordered to more closely match what is produced in the web tool:


ora_reordered <- ora$result[, c("source", "term_name", "term_id", "p_value", "term_size", "query_size", "intersection_size", "effective_domain_size")]

head(ora_reordered)

write.csv(ora_reordered, "gprofiler_ORA_results.csv", row.names = FALSE)

6. Visualise the results

Manhattan plots

The gostplot function creates a Manhattan plot similar to the one shown on the web tool. By setting interactive=TRUE we can hover over the data points to see enriched term details.


gostplot(ora, 
  capped = TRUE, 
  interactive = TRUE, 
  pal = c(`GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099") 
)

There are a lot of significant enrichments for GO biological processes. Many of these are probably terms containing a large number of genes, so not particularly informative. Since there is no direct parameter to restrict term size to gostplot, we can filter the ORA results before plotting.

# Filter the results for GO:BP terms with term_size <= 500
ora_filter_termsize <- ora
ora_filter_termsize$result <- ora$result %>% filter(term_size <= 500)

# Plot with gostplot using the filtered results
gostplot(ora_filter_termsize, 
  capped = TRUE, 
  interactive = TRUE, 
  pal = c(
    `GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099"
  )
)

gprofiler2 includes a function for creating a publication-ready image that can optionally highlight specific terms. We need to first produce a plot with interactice = FALSE, save it to an object, and then provide that plot object to the publish_gostplot function.


# Plot with gostplot using the filtered results
plot <- gostplot(ora_filter_termsize, 
  capped = TRUE, 
  interactive = FALSE, 
  pal = c(
    `GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099"
  )
)

The publish_gostplot function has a highlight_terms parameter that enables you to highlight specific terms on the plot, with a table showing enrichment details below for those highlighted terms.

Let’s highlight some selected terms manually. You need to provide the term ID not term name.

#Term IDs for 'Collagen degradation' and 'Collagen formation'
highlight <- c("REAC:R-HSA-1442490", "REAC:R-HSA-1474290")

filename <- "gostplot_filter_termsize_collagen.png"

publish_gostplot(plot, 
  highlight_terms = highlight, 
  filename, 
  width = 10, 
  height = 10 )

You can use R grepl function to search for terms with names matching some keyword. Let’s highlight all terms related to receptors:


# Filter terms containing "receptor" keyword and create a list of those term IDs
highlight <- ora$result %>% 
  filter(grepl("receptor", term_name, ignore.case = TRUE)) %>% 
  pull(term_id) 

filename <- "gostplot_filter_termsize_receptors.png"

publish_gostplot(plot, 
  highlight_terms = highlight, 
  filename, 
  width = 10, 
  height = 10 )

Dotplots

If you wanted to plot each database separately as a traditional dotplot, you can loop through each database and use the R package ggplot2 to make a dotplot. We can filter the results by database and term size:

# List of databases
dbs <- unique(ora$result$source)

# Loop through each database and create a separate plo for each
for (db in dbs) {
  #db_results <- ora$result 
  db_results <- ora$result %>% filter(source == db, term_size >= 10, term_size <= 500)
  
  # Create a dot plot for the current database
  p <- ggplot(db_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
    geom_point(aes(size = term_size, color = significant)) +
    labs(title = paste(db), 
         x = "Term", 
         y = "-log10(p-value)") +
    theme_minimal() +
    coord_flip()  # Flips the coordinates for better visibility
  
  # Print the plot
  print(p)
}

For plots with a lot of enriched terms, the display within the notebook is less than ideal. Saving the plot to an image file enables better resolution:


# Filter for the GO:BP database
go_bp_results <- ora$result %>% filter(source == "GO:BP",term_size >= 10, term_size <= 500)

# Create the plot for GO:BP
p_go_bp <- ggplot(go_bp_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
  geom_point(aes(size = term_size, color = significant)) +
  labs(title = "GO:BP", 
       x = "Term", 
       y = "-log10(p-value)") +
  theme_minimal() +
  coord_flip()  # Flips the coordinates for better visibility


# Open a PDF device to save the plot as a full size A4: 
pdf("go_bp_plot.pdf", width = 8.27, height = 11.69)  # A4 portrait size in inches

# Print the plot to the device
print(p_go_bp)

# Close the device (this saves the plot)
dev.off()

7. Run a gost multi-query for up-regulated and down-regulated genes, and compare the results to those generated from the full DEG gene list

By providing more than one gene list and setting multi_query = TRUE, results from all of the gene lists are grouped by term IDs for easier comparison.

First, we need to re-extract our gene list so we have separate DEGs for up-regulated and down-regulated genes.

up_degs <- data %>%
  filter(FDR < 0.01 & Log2FC >= 2) %>%
  pull(Gene.ID)
cat("Number of upregulated DEGs:", length(up_degs), "\n")
Number of upregulated DEGs: 577 
# Save the DEG gene list to disk: 
write.table(up_degs, "up_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

down_degs <- data %>%
  filter(FDR < 0.01 & Log2FC <= -2) %>%
  pull(Gene.ID)
cat("Number of downregulated DEGs:", length(down_degs), "\n")
Number of downregulated DEGs: 215 
# Save the DEG gene list to disk: 
write.table(down_degs, "down_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

Now run gost as multi-query:

ora_multi <- gost( 
    query = list("upregulated" = up_degs, "downregulated" = down_degs), 
    organism = "hsapiens", 
    ordered_query = FALSE, 
    multi_query = TRUE, 
    significant = TRUE, 
    exclude_iea = FALSE, 
    measure_underrepresentation = FALSE, 
    evcodes = FALSE, 
    user_threshold = 0.05, 
    correction_method = "g_SCS", 
    domain_scope = "custom_annotated", 
    custom_bg = background, 
    numeric_ns = "", 
    sources = NULL, 
    as_short_link = FALSE, 
    highlight = FALSE 
)

And create an interactive plot with gostplot:

gostplot(ora_multi, capped = TRUE, interactive = TRUE)

To access the tabular results separately, they need to be split. View the output to see that the columns are comma separated, upregulated result first (query 1 in our list) and down-regulated second (query 2).

— WHY DOES THIS WORK IN THE CONSOLE BUT NOT THE NOTEBOK??? — WHAT IS A SANE WAY TO EXTRACT THESE RESULTS TO VIEW THE TABLE

head(ora_multi$result)

8. Compare gprofiler2 R results to the g:Profiler web results

In day 1 of the workshop, you ran ORA with g:Profiler web tool and saved the results to a CSV. Let’s compare the results to those we have generated in R. Do we expect the results to be identical or differ slightly?

# ensure the filepath matches yours. Note the "../" means one directory level below this one
web <- read.csv("../day_1/gProfiler_hsapiens_07-11-2024_11-27-09__intersections.csv")

Check the numbers: are there any terms significant frm one tool but not the other?

# Extract significant term names
web_terms <- web$term_name
ora_terms <- ora$result$term_name

paste0("Number of significant terms from web: ", length(web_terms))
paste0("Number of significant terms from R: ", length(ora_terms))

# Find command and unique terms
common_terms <- intersect(web_terms, ora_terms)
if (length(common_terms) == length(web_terms) && length(common_terms) == length(ora_terms)) {
  # If the lengths match, all terms are shared
  print("All terms are shared")
} else {
  # If there are differences, report the number of terms
  unique_web <- setdiff(web_terms, ora_terms)
  unique_ora <- setdiff(ora_terms, web_terms)
  
  print(paste("Number of terms unique to web:", length(unique_web)))
  print(paste("Number of terms unique to gprofiler2 (R):", length(unique_ora)))
}

That’s a good start! Do the P values differ by much? Let’s look closely at GO ‘Molecular Function’.


# Filter for GO:MF terms
go_mf_web <- web %>% filter(source == "GO:MF")
go_mf_r <- ora$result %>% filter(source == "GO:MF")

# Extract term names and p values
comparison_data_go_mf <- data.frame(
  term_name = go_mf_web$term_name,
  p_value_web = go_mf_web$adjusted_p_value,
  p_value_r = go_mf_r$p_value
)


# Reshape the data to long format
comparison_data_long <- comparison_data_go_mf %>%
  pivot_longer(cols = starts_with("p_value"), 
               names_to = "source", 
               values_to = "p_value")

# Create the bar plot with -log10 transformed p-values
print(ggplot(comparison_data_long, aes(x = term_name, y = -log10(p_value), fill = source)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +  # Side-by-side bars
  labs(title = "Adjusted P value comparison for GO:MF enrichments",
       x = "Term Name",
       y = "-log10(P-value)") +
  scale_fill_manual(values = c("p_value_web" = "#ff9900", "p_value_r" = "#3366cc")) +  # Custom colors
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) )  

Great! We know we applied the same parameters, and used the same input gene lists. The identical results must mean that gprofiler2 is using the same database version as g:Profiler web.

Saving versions and session details

Let’s check: yesterday when you ran ORA on the web, hopefully you saved your ‘query parameters’ as well as your results.

From my run, I can see version as ‘e111_eg58_p18_f463989d’.

Let’s report the g:Profiler database version used in our analysis:


paste0("g:Profiler database version: ", ora$meta$version)
paste0("gprofiler2 package version: ", packageVersion("gprofiler2"))

We can also capture the version of R and other session details including all loaded packages and versions with the sessionInfo() function:

sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8    LC_MONETARY=English_Australia.utf8
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.utf8    

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidyr_1.3.1            gprofiler2_0.2.3       ggplot2_3.5.1          biomaRt_2.62.0         DOSE_4.0.0            
 [6] enrichplot_1.26.2      clusterProfiler_4.14.0 org.Hs.eg.db_3.20.0    AnnotationDbi_1.68.0   IRanges_2.40.0        
[11] S4Vectors_0.44.0       Biobase_2.66.0         BiocGenerics_0.52.0    ReactomePA_1.50.0      dplyr_1.1.4           
[16] readr_2.1.5            WebGestaltR_0.4.6     

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3      rstudioapi_0.17.1       jsonlite_1.8.9          magrittr_2.0.3          ggtangle_0.0.4         
  [6] farver_2.1.2            rmarkdown_2.29          fs_1.6.5                zlibbioc_1.52.0         vctrs_0.6.5            
 [11] memoise_2.0.1           RCurl_1.98-1.16         ggtree_3.14.0           progress_1.2.3          htmltools_0.5.8.1      
 [16] curl_5.2.3              gridGraphics_0.5-1      sass_0.4.9              bslib_0.8.0             htmlwidgets_1.6.4      
 [21] httr2_1.0.6             plyr_1.8.9              plotly_4.10.4           cachem_1.1.0            whisker_0.4.1          
 [26] igraph_2.1.1            lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3         Matrix_1.7-1           
 [31] R6_2.5.1                fastmap_1.2.0           gson_0.1.0              GenomeInfoDbData_1.2.13 digest_0.6.37          
 [36] aplot_0.2.3             ggnewscale_0.5.0        colorspace_2.1-1        patchwork_1.3.0         crosstalk_1.2.1        
 [41] RSQLite_2.3.7           labeling_0.4.3          filelock_1.0.3          fansi_1.0.6             httr_1.4.7             
 [46] polyclip_1.10-7         compiler_4.4.2          rngtools_1.5.2          bit64_4.5.2             withr_3.0.2            
 [51] doParallel_1.0.17       graphite_1.52.0         BiocParallel_1.40.0     viridis_0.6.5           DBI_1.2.3              
 [56] ggupset_0.4.0           apcluster_1.4.13        ggforce_0.4.2           R.utils_2.12.3          MASS_7.3-61            
 [61] rappdirs_0.3.3          tools_4.4.2             ape_5.8                 R.oo_1.27.0             glue_1.8.0             
 [66] nlme_3.1-166            GOSemSim_2.32.0         grid_4.4.2              reshape2_1.4.4          snow_0.4-4             
 [71] fgsea_1.32.0            generics_0.1.3          gtable_0.3.6            tzdb_0.4.0              R.methodsS3_1.8.2      
 [76] data.table_1.16.2       hms_1.1.3               xml2_1.3.6              tidygraph_1.3.1         utf8_1.2.4             
 [81] XVector_0.46.0          ggrepel_0.9.6           foreach_1.5.2           pillar_1.9.0            stringr_1.5.1          
 [86] yulab.utils_0.1.8       vroom_1.6.5             splines_4.4.2           tweenr_2.0.3            BiocFileCache_2.14.0   
 [91] treeio_1.30.0           lattice_0.22-6          bit_4.5.0               tidyselect_1.2.1        GO.db_3.20.0           
 [96] Biostrings_2.74.0       reactome.db_1.89.0      knitr_1.49              gridExtra_2.3           svglite_2.1.3          
[101] xfun_0.49               graphlayouts_1.2.0      stringi_1.8.4           UCSC.utils_1.2.0        lazyeval_0.2.2         
[106] ggfun_0.1.7             yaml_2.3.10             evaluate_1.0.1          codetools_0.2-20        ggraph_2.2.1           
[111] tibble_3.2.1            qvalue_2.38.0           BiocManager_1.30.25     graph_1.84.0            ggplotify_0.1.2        
[116] cli_3.6.3               systemfonts_1.1.0       munsell_0.5.1           jquerylib_0.1.4         Rcpp_1.0.13-1          
[121] GenomeInfoDb_1.42.0     dbplyr_2.5.0            png_0.1-8               parallel_4.4.2          blob_1.2.4             
[126] prettyunits_1.2.0       doRNG_1.8.6             bitops_1.0-9            viridisLite_0.4.2       tidytree_0.4.6         
[131] ggridges_0.5.6          scales_1.3.0            purrr_1.0.2             crayon_1.5.3            rlang_1.1.4            
[136] cowplot_1.1.3           fastmatch_1.1-4         KEGGREST_1.46.0        

And finally, the RStudio verson:

RStudio.Version()
$citation
To cite RStudio in publications use:

  Posit team (2024). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL
  http://www.posit.co/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {RStudio: Integrated Development Environment for R},
    author = {{Posit team}},
    organization = {Posit Software, PBC},
    address = {Boston, MA},
    year = {2024},
    url = {http://www.posit.co/},
  }

$mode
[1] "desktop"

$version
[1] ‘2024.9.1.394’

$long_version
[1] "2024.09.1+394"

$release_name
[1] "Cranberry Hibiscus"

End of activity summary

The last task is to knit the notebook. Our notebook is editable, and can be changed. Deleting code deletes the output, so we could lose valuable details. If we knit the notebook to HTML, we have a permanent static copy of the work.

---
title: "ORA with gprofiler2"
output: html_notebook
---


```{r Load R packages, include=FALSE}
library(readr)
library(dplyr)
library(gprofiler2)
library(ggplot2)
library(tidyr)

```


[`gprofiler2`](https://cran.r-project.org/web/packages/gprofiler2/index.html) is the R interface to the [`g:Profiler`](https://biit.cs.ut.ee/gprofiler/gost) toolset. 

Like the web interface, gprofiler2 performs ORA with `g:GOSt` against multiple databases simultaneously. It supports multiple namespaces and numerous species. 

# 1. Load input data

The dataset comprises 2 experimental groups (differentiated and undifferentiated cells in response to  treatment) with 3 replicates in each group. The raw data from [Pezzini et al 2016](https://link.springer.com/article/10.1007/s10571-016-0403-y) was subjected to differential gene expression analysis with [Degust](https://degust.erc.monash.edu/) and the results file saved. 

You will find a copy of the DE results `Pezzini_DE.txt` within the `Functional_enrichment_workshop_2024/day_2` folder. 

First, ensure that your working directory reflects this day_2 folder:


```{r setup, include=FALSE}
# Set the root directory for all notebook chunks
knitr::opts_knit$set(root.dir = "C:/Users/cwil2281/OneDrive - The University of Sydney (Staff)/Documents/PIPE-5119-enrichment-workshop/day_2")

# and set the working directory for the console:
setwd("C:/Users/cwil2281/OneDrive - The University of Sydney (Staff)/Documents/PIPE-5119-enrichment-workshop/day_2")


getwd()

```



Please ask for assistance if this step does not show the day_2 folder path! 

Once your working directory is set, load the DE input file, and inspect the first few lines: 


```{r load input data}
data <- read_tsv("Pezzini_DE.txt", col_names = TRUE, show_col_types = FALSE)
head(data)
```

The dataframe shows genes with fold change and FDR values, along with some normalised counts values for the 6 samples. 

Look on the environment pane of RStudio, and you can see a description '14420 obs. of 10 variables' - this shows your dataframe consists of 10 columns and 14,420 genes. 

# 2. Get ORA gene list

Now we need to filter for DEGs, and we will apply the same thresholds as applied in day 1: adjusted P values/FDR < 0.01, and log2fold change of 2. 

```{r get ORA gene list}
degs <- data %>%
  filter(FDR < 0.01 & abs(Log2FC) > 2) %>%
  pull(Gene.ID)
cat("Number of genes passing FDR and fold change filter:", length(degs), "\n")

# Save the DEG gene list to disk: 
write.table(degs, "DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

```
We have 792 genes passing our filters. 

# 3. Get background gene list

Recall from the webinar and day 1 of the workshop that an experimental background gene list is crucial to avoiding false positives and minimising tissue bias. 

The analysis in Degust has already removed lowly expressed genes, so we can simply extract all genes from this data matrix as our background gene list and save it as our 'background' object. 

```{r get background }
background <- data$Gene.ID
cat("Number of background genes:", length(background), "\n")

# Save the background gene list to disk:
write.table(background, "background_genes.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
```

# 4. Run ORA with `gost` function

Before running the below code chunk, review the parameters. Observe the similarities to what was available on the g:Profiler web interface, for example organism, the correction method (g:Profiler's custom `g_scs` method), and domain scope. 

Every R package has a user guide that outlines available parameters. Open the [gprofiler2](https://cran.r-project.org/web/packages/gprofiler2/gprofiler2.pdf) user guide and navigate to page detailing the `gost` function (currently page 6). 

Under the `Usage` subheading, all available  parameters are listed, along with their default settings.  You can over-ride these parameters by specifying your custom values in your R command. 

The below full code shows all parameters: 

```{r run gost}
ora <- gost( 
    degs, 
    organism = "hsapiens", 
    ordered_query = FALSE, 
    multi_query = FALSE, 
    significant = TRUE, 
    exclude_iea = FALSE, 
    measure_underrepresentation = FALSE, 
    evcodes = FALSE, 
    user_threshold = 0.05, 
    correction_method = "g_SCS", 
    domain_scope = "custom_annotated", 
    custom_bg = background, 
    numeric_ns = "", 
    sources = NULL, 
    as_short_link = FALSE, 
    highlight = FALSE 
)
```

Since we are using many of the default parameters, we could shorten this code to what is shown below. The results would be identical, however not as transparent as far as easily identifying what parameters were applied to a run. 

Tip: you can run the command `formals(gost)` to print out all parameters and their default values. This is a generic R function so can be used on other tools outside of `gprofiler2`. 


```{r abbreviated gost code }

#ora <- gost( 
#    degs, 
#    correction_method = "g_SCS", 
#    domain_scope = "custom_annotated",
#    custom_bg = background,
#)

```

```{r print defaults}
formals(gost)
```



We have now run ORA with gprofiler2! View the to-most significant enrichments with the R `head` command. Only significant enrichments passing your specified threshold are included in the results object we have named `ora`. 


```{r head ora}
head(ora$result)
```


These are all biological process. What databases did this run ORA against?


```{r gost dbs}
unique(ora$result$source)
```


Same as the web tool, we have enrichment results for GP (BP, CC, MF), HP (human phenotype), HPA (human protein atlas), KEGG, MiRNA, Reactome, Transcription Factors and WikiPathways. 



# 5. Save the results to a file

The results can be saved to disk to simplify sharing or local viewing outside the R environment. The output can be re-ordered to more closely match what is produced in the web tool: 


```{r write ora table}

ora_reordered <- ora$result[, c("source", "term_name", "term_id", "p_value", "term_size", "query_size", "intersection_size", "effective_domain_size")]

head(ora_reordered)

write.csv(ora_reordered, "gprofiler_ORA_results.csv", row.names = FALSE)


```


# 6. Visualise the results 

## Manhattan plots 

The `gostplot` function creates a Manhattan plot similar to the one shown on the web tool. By setting `interactive=TRUE` we can hover over the data points to see enriched term details. 


```{r gostplot}

gostplot(ora, 
  capped = TRUE, 
  interactive = TRUE, 
  pal = c(`GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099") 
)

```


There are a lot of significant enrichments for GO biological processes. Many of these are probably terms containing a large number of genes, so not particularly informative. Since there is no direct parameter to restrict term size to `gostplot`, we can filter the ORA results before plotting.
 

```{r gostplot - filtered term size}
# Filter the results for GO:BP terms with term_size <= 500
ora_filter_termsize <- ora
ora_filter_termsize$result <- ora$result %>% filter(term_size <= 500)

# Plot with gostplot using the filtered results
gostplot(ora_filter_termsize, 
  capped = TRUE, 
  interactive = TRUE, 
  pal = c(
    `GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099"
  )
)

```


gprofiler2 includes a function for creating a publication-ready image that can optionally highlight specific terms. We need to first produce a plot with `interactice = FALSE`, save it to an object, and then provide that plot object to the `publish_gostplot` function. 


```{r gostplot non-interactive}

# Plot with gostplot using the filtered results
plot <- gostplot(ora_filter_termsize, 
  capped = TRUE, 
  interactive = FALSE, 
  pal = c(
    `GO:MF` = "#dc3912", 
    `GO:BP` = "#ff9900", 
    `GO:CC` = "#109618", 
    KEGG = "#dd4477", 
    REAC = "#3366cc", 
    WP = "#0099c6", 
    TF = "#5574a6", 
    MIRNA = "#22aa99", 
    HPA = "#6633cc", 
    CORUM = "#66aa00", 
    HP = "#990099"
  )
)
```

The `publish_gostplot` function has a `highlight_terms` parameter that enables you to highlight specific terms on the plot, with a table showing enrichment details below for those highlighted terms. 

Let's highlight some selected terms manually. You need to provide the term ID not term name. 

```{r publish gostplot - collagen }
#Term IDs for 'Collagen degradation' and 'Collagen formation'
highlight <- c("REAC:R-HSA-1442490", "REAC:R-HSA-1474290")

filename <- "gostplot_filter_termsize_collagen.png"

publish_gostplot(plot, 
  highlight_terms = highlight, 
  filename, 
  width = 10, 
  height = 10 )

```


You can use R `grepl` function to search for terms with names matching some keyword. Let's highlight all terms related to receptors:

```{r publish gostplot - receptors}

# Filter terms containing "receptor" keyword and create a list of those term IDs
highlight <- ora$result %>% 
  filter(grepl("receptor", term_name, ignore.case = TRUE)) %>% 
  pull(term_id) 

filename <- "gostplot_filter_termsize_receptors.png"

publish_gostplot(plot, 
  highlight_terms = highlight, 
  filename, 
  width = 10, 
  height = 10 )

```



## Dotplots

If you wanted to plot each database separately as a traditional dotplot, you can loop through each database and use the R package `ggplot2` to make a dotplot.  We can filter the results by database and term size:


```{r ora dotplots }
# List of databases
dbs <- unique(ora$result$source)

# Loop through each database and create a separate plo for each
for (db in dbs) {
  #db_results <- ora$result 
  db_results <- ora$result %>% filter(source == db, term_size >= 10, term_size <= 500)
  
  # Create a dot plot for the current database
  p <- ggplot(db_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
    geom_point(aes(size = term_size, color = significant)) +
    labs(title = paste(db), 
         x = "Term", 
         y = "-log10(p-value)") +
    theme_minimal() +
    coord_flip()  # Flips the coordinates for better visibility
  
  # Print the plot
  print(p)
}
```

 
For plots with a lot of enriched terms, the display within the notebook is less than ideal. Saving the plot to an image file enables better resolution: 


```{r save go bp plot}

# Filter for the GO:BP database
go_bp_results <- ora$result %>% filter(source == "GO:BP",term_size >= 10, term_size <= 500)

# Create the plot for GO:BP
p_go_bp <- ggplot(go_bp_results, aes(x = reorder(term_name, -p_value), y = -log10(p_value))) +
  geom_point(aes(size = term_size, color = significant)) +
  labs(title = "GO:BP", 
       x = "Term", 
       y = "-log10(p-value)") +
  theme_minimal() +
  coord_flip()  # Flips the coordinates for better visibility


# Open a PDF device to save the plot as a full size A4: 
pdf("go_bp_plot.pdf", width = 8.27, height = 11.69)  # A4 portrait size in inches

# Print the plot to the device
print(p_go_bp)

# Close the device (this saves the plot)
dev.off()

```




# 7. Run a gost multi-query for up-regulated and down-regulated genes, and compare the results to those generated from the full DEG gene list

By providing more than one gene list and setting `multi_query = TRUE`, results from all of the gene lists are grouped by term IDs for easier comparison.


First, we need to re-extract our gene list so we have separate DEGs for up-regulated and down-regulated genes. 

```{r get ORA gene lists up and down }
up_degs <- data %>%
  filter(FDR < 0.01 & Log2FC >= 2) %>%
  pull(Gene.ID)
cat("Number of upregulated DEGs:", length(up_degs), "\n")

# Save the DEG gene list to disk: 
write.table(up_degs, "up_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

down_degs <- data %>%
  filter(FDR < 0.01 & Log2FC <= -2) %>%
  pull(Gene.ID)
cat("Number of downregulated DEGs:", length(down_degs), "\n")

# Save the DEG gene list to disk: 
write.table(down_degs, "down_DEGs.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

```
Now run `gost` as multi-query: 

```{r run gost multi-query }
ora_multi <- gost( 
    query = list("upregulated" = up_degs, "downregulated" = down_degs), 
    organism = "hsapiens", 
    ordered_query = FALSE, 
    multi_query = TRUE, 
    significant = TRUE, 
    exclude_iea = FALSE, 
    measure_underrepresentation = FALSE, 
    evcodes = FALSE, 
    user_threshold = 0.05, 
    correction_method = "g_SCS", 
    domain_scope = "custom_annotated", 
    custom_bg = background, 
    numeric_ns = "", 
    sources = NULL, 
    as_short_link = FALSE, 
    highlight = FALSE 
)
```

And create an interactive plot with `gostplot`:

```{r gostplot multi}
gostplot(ora_multi, capped = TRUE, interactive = TRUE)
```

To access the tabular results separately, they need to be split. View the output to see that the columns are comma separated, upregulated result first (query 1 in our list) and down-regulated second (query 2). 


--- WHY DOES THIS WORK IN THE CONSOLE BUT NOT THE NOTEBOK???
--- WHAT IS A SANE WAY TO EXTRACT THESE RESULTS TO VIEW THE TABLE

```{r}
head(ora_multi$result)
```


# 8. Compare gprofiler2 R results to the g:Profiler web results

In day 1 of the workshop, you ran ORA with g:Profiler web tool and saved the results to a CSV. Let's compare the results to those we have generated in R. Do we expect the results to be identical or differ slightly? 


```{r import g:profiler web results}
# ensure the filepath matches yours. Note the "../" means one directory level below this one
web <- read.csv("../day_1/gProfiler_hsapiens_07-11-2024_11-27-09__intersections.csv")
```


Check the numbers: are there any terms significant frm one tool but not the other? 

```{r}
# Extract significant term names
web_terms <- web$term_name
ora_terms <- ora$result$term_name

paste0("Number of significant terms from web: ", length(web_terms))
paste0("Number of significant terms from R: ", length(ora_terms))

# Find command and unique terms
common_terms <- intersect(web_terms, ora_terms)
if (length(common_terms) == length(web_terms) && length(common_terms) == length(ora_terms)) {
  # If the lengths match, all terms are shared
  print("All terms are shared")
} else {
  # If there are differences, report the number of terms
  unique_web <- setdiff(web_terms, ora_terms)
  unique_ora <- setdiff(ora_terms, web_terms)
  
  print(paste("Number of terms unique to web:", length(unique_web)))
  print(paste("Number of terms unique to gprofiler2 (R):", length(unique_ora)))
}

```

That's a good start! Do the P values differ by much? Let's look closely at GO 'Molecular Function'. 

```{r fig.width=10, fig.height=8}

# Filter for GO:MF terms
go_mf_web <- web %>% filter(source == "GO:MF")
go_mf_r <- ora$result %>% filter(source == "GO:MF")

# Extract term names and p values
comparison_data_go_mf <- data.frame(
  term_name = go_mf_web$term_name,
  p_value_web = go_mf_web$adjusted_p_value,
  p_value_r = go_mf_r$p_value
)


# Reshape the data to long format
comparison_data_long <- comparison_data_go_mf %>%
  pivot_longer(cols = starts_with("p_value"), 
               names_to = "source", 
               values_to = "p_value")

# Create the bar plot with -log10 transformed p-values
print(ggplot(comparison_data_long, aes(x = term_name, y = -log10(p_value), fill = source)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +  # Side-by-side bars
  labs(title = "Adjusted P value comparison for GO:MF enrichments",
       x = "Term Name",
       y = "-log10(P-value)") +
  scale_fill_manual(values = c("p_value_web" = "#ff9900", "p_value_r" = "#3366cc")) +  # Custom colors
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) )  
```

Great! We know we applied the same parameters, and used the same input gene lists. The identical results must mean that gprofiler2 is using the same database version as g:Profiler web. 

# Saving versions and session details 

Let's check: yesterday when you ran ORA on the web, hopefully you saved your 'query parameters' as well as your results. 

From my run, I can see version as 'e111_eg58_p18_f463989d'. 

Let's report the g:Profiler database version used in our analysis:

```{r gprofiler versions}

paste0("g:Profiler database version: ", ora$meta$version)
paste0("gprofiler2 package version: ", packageVersion("gprofiler2"))

```

We can also capture the version of R and other session details including all loaded packages and versions with the `sessionInfo()` function: 

```{r info }
sessionInfo()
```
And finally, the RStudio verson: 

```{r rstudio v}
RStudio.Version()
```
# End of activity summary

- We have extracted a gene list and background gene list from a DE dataset and run ORA with `gprofiler2` `gost` function
- We have plotted the data with `gostplot` Manhattan plots and `ggplot2` dotplots
- We have run a `gost` multi-query separating up and down regulated genes
- We have verified that `gprofiler2` results match the results from `g:Profiler` web
- We have captured all version details relevant to the session within the R notebook

The last task is to `knit` the notebook. Our notebook is editable, and can be changed. Deleting code deletes the output, so we could lose valuable details. If we knit the notebook to HTML, we have a permanent static copy of the work. 















